library(relaimpo)
Loading required package: MASS
package ‘MASS’ was built under R version 3.6.2
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
Loading required package: boot
Loading required package: survey
package ‘survey’ was built under R version 3.6.2Loading required package: grid
Loading required package: Matrix
Attaching package: ‘Matrix’
The following objects are masked from ‘package:tidyr’:
expand, pack, unpack
Loading required package: survival
Attaching package: ‘survival’
The following object is masked from ‘package:boot’:
aml
Attaching package: ‘survey’
The following object is masked from ‘package:graphics’:
dotchart
Loading required package: mitools
This is the global version of package relaimpo.
If you are a non-US user, a version with the interesting additional metric pmvd is available
from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.
house_price <- read_csv(here("data/kc_house_data.csv"))
glimpse(house_price)
house_price <- house_price %>%
select(-c(date, id, sqft_living15, sqft_lot15, zipcode))
summary(house_price)
price bedrooms bathrooms sqft_living
Min. : 75000 Min. : 0.000 Min. :0.000 Min. : 290
1st Qu.: 321950 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1427
Median : 450000 Median : 3.000 Median :2.250 Median : 1910
Mean : 540088 Mean : 3.371 Mean :2.115 Mean : 2080
3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550
Max. :7700000 Max. :33.000 Max. :8.000 Max. :13540
sqft_lot floors waterfront view
Min. : 520 Min. :1.000 Mode :logical Min. :0.0000
1st Qu.: 5040 1st Qu.:1.000 FALSE:21450 1st Qu.:0.0000
Median : 7618 Median :1.500 TRUE :163 Median :0.0000
Mean : 15107 Mean :1.494 Mean :0.2343
3rd Qu.: 10688 3rd Qu.:2.000 3rd Qu.:0.0000
Max. :1651359 Max. :3.500 Max. :4.0000
condition grade sqft_above sqft_basement yr_built
Min. :1.000 Min. : 1.000 Min. : 290 Min. : 0.0 Min. :1900
1st Qu.:3.000 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
Median :3.000 Median : 7.000 Median :1560 Median : 0.0 Median :1975
Mean :3.409 Mean : 7.657 Mean :1788 Mean : 291.5 Mean :1971
3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997
Max. :5.000 Max. :13.000 Max. :9410 Max. :4820.0 Max. :2015
yr_renovated lat long
Min. : 0.0 Min. :47.16 Min. :-122.5
1st Qu.: 0.0 1st Qu.:47.47 1st Qu.:-122.3
Median : 0.0 Median :47.57 Median :-122.2
Mean : 84.4 Mean :47.56 Mean :-122.2
3rd Qu.: 0.0 3rd Qu.:47.68 3rd Qu.:-122.1
Max. :2015.0 Max. :47.78 Max. :-121.3
We should convert the waterfront to a logical vector and try it as a categorical variable.
house_price <- house_price %>%
mutate_at("waterfront", as.logical)
house_price <- house_price %>%
mutate(renovated = ifelse(yr_renovated == 0, FALSE, TRUE)) %>%
select(-yr_renovated)
unique(house_price$condition)
[1] 3 5 4 1 2
unique(house_price$grade)
[1] 7 6 8 11 9 5 10 12 4 3 13 1
Grade is a clasification of the house acording to the material that they use for building the house, so when the building has a greater grade the cost per unit measure is higher. In this sense we can consider this as categorical ordinal because they imply a order but it isn’t a numerical interval order. We can say the same about condition.
house_price <- house_price%>%
mutate_at("condition", as.factor) %>%
mutate_at("grade", as.factor)
mod_pre <- lm(price ~ ., data = house_price)
alias(mod_pre)
Model :
price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
waterfront + view + condition + grade + sqft_above + sqft_basement +
yr_built + lat + long + renovated
Complete :
(Intercept) bedrooms bathrooms sqft_living sqft_lot floors
sqft_basement 0 0 0 1 0 0
waterfrontTRUE view condition2 condition3 condition4 condition5
sqft_basement 0 0 0 0 0 0
grade3 grade4 grade5 grade6 grade7 grade8 grade9 grade10 grade11
sqft_basement 0 0 0 0 0 0 0 0 0
grade12 grade13 sqft_above yr_built lat long renovatedTRUE
sqft_basement 0 0 -1 0 0 0 0
Nonzero entries in the “complete” matrix show that those terms are linearly dependent on UseMonthly. This means they’re highly correlated, but terms can be highly correlated without being linearly dependent.
So I will drop sqft_basement, sqft_living, sqft_above because the suppose to have a linearly dependent, if I interpreted in good way the mean.
house_price <- house_price %>%
select(-c(sqft_basement, sqft_living, sqft_above))
house_price_no_numetic <- house_price %>%
select_if(!is.numeric)
Error in !is.numeric : invalid argument type
house_price_numeric %>%
ggpairs()
NA
NA
NA
NA
NA
house_price_no_numetic %>%
ggpairs()
summary(model_price_bath)
Call:
lm(formula = price ~ bathrooms, data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1438157 -184525 -41525 113220 5925322
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10708 6211 1.724 0.0847 .
bathrooms 250326 2760 90.714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 312400 on 21611 degrees of freedom
Multiple R-squared: 0.2758, Adjusted R-squared: 0.2757
F-statistic: 8229 on 1 and 21611 DF, p-value: < 2.2e-16
The bathroom only explain the price in a 27.57 %
par(mfrow = c(2,2))
plot(model_price_bath)
The scale location plot follow a trend to go up and the normal plot the residuals aren’t normal at the end
house_price %>%
add_residuals(model_price_bath) %>%
select_if(function(x) is.numeric(x)) %>%
select(-c(price, bathrooms)) %>%
ggpairs()
The view is high correlative with the residual let’s check it in combination with bathroom
model_price_bath_view <- lm(price ~ bathrooms + view, data = house_price)
summary(model_price_bath_view)
Call:
lm(formula = price ~ bathrooms + view, data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1254186 -169132 -34786 113486 5729504
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34550 5816 5.94 2.89e-09 ***
bathrooms 222618 2624 84.84 < 2e-16 ***
view 148332 2637 56.25 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 291800 on 21610 degrees of freedom
Multiple R-squared: 0.3683, Adjusted R-squared: 0.3682
F-statistic: 6298 on 2 and 21610 DF, p-value: < 2.2e-16
The combination of the bathrooms and view could explain in a 36% the price
par(mfrow = c(2,2))
plot(model_price_bath_view)
The residual continue to have a trend and they aren’t normal at the end.
Now let’s try to find a feature inside our categorical data
house_price %>%
add_residuals(model_price_bath_view) %>%
select(waterfront, condition, grade, renovated, resid) %>%
ggpairs()
NA
The condition looks quite interesting special around 3
summary(model_bath_view_condition)
Call:
lm(formula = price ~ bathrooms + view + condition, data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1267263 -169667 -32811 113469 5740221
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22246 53084 0.419 0.675
bathrooms 228574 2672 85.544 <2e-16 ***
view 145369 2631 55.248 <2e-16 ***
condition2 -37180 57440 -0.647 0.517
condition3 -19290 53128 -0.363 0.717
condition4 26272 53170 0.494 0.621
condition5 80272 53508 1.500 0.134
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 290300 on 21606 degrees of freedom
Multiple R-squared: 0.3751, Adjusted R-squared: 0.3749
F-statistic: 2161 on 6 and 21606 DF, p-value: < 2.2e-16
To be honest the increase is only in 1 %. Let’s check the anova
anova(model_price_bath_view, model_bath_view_condition)
Analysis of Variance Table
Model 1: price ~ bathrooms + view
Model 2: price ~ bathrooms + view + condition
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21610 1.8402e+15
2 21606 1.8204e+15 4 1.9854e+13 58.912 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The model include condition is better because the p values is lower than 0.005 but we can add other more no categorical variable and see wath hapen before add condition
house_price %>%
add_residuals(model_price_bath_view) %>%
select_if(function(x) is.numeric(x)) %>%
select(-c(price, bathrooms, view)) %>%
ggpairs()
Lat continue to be high correlative with resid so lets include it
model_bath_view_lat <- lm(price ~ bathrooms + view + lat,
data = house_price)
summary(model_bath_view_lat)
Call:
lm(formula = price ~ bathrooms + view + lat, data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1261035 -137379 -29702 89832 5667993
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36980261 633204 -58.40 <2e-16 ***
bathrooms 219219 2439 89.88 <2e-16 ***
view 148107 2451 60.44 <2e-16 ***
lat 778428 13316 58.46 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 271200 on 21609 degrees of freedom
Multiple R-squared: 0.4545, Adjusted R-squared: 0.4544
F-statistic: 6002 on 3 and 21609 DF, p-value: < 2.2e-16
With lat the r^2 going up more than 10. Let’s check the graphs
par(mfrow = c(2,2))
plot(model_bath_view_lat)
The result continue to be quite the same
model_bath_view_lat_1 <- lm(log(price) ~ bathrooms + view + lat,
data = house_price)
summary(model_bath_view_lat_1)
Call:
lm(formula = log(price) ~ bathrooms + view + lat, data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1.85904 -0.22650 -0.01743 0.21097 1.91082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -66.436541 0.821213 -80.90 <2e-16 ***
bathrooms 0.337090 0.003163 106.56 <2e-16 ***
view 0.172717 0.003178 54.34 <2e-16 ***
lat 1.655402 0.017270 95.86 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3517 on 21609 degrees of freedom
Multiple R-squared: 0.5542, Adjusted R-squared: 0.5541
F-statistic: 8955 on 3 and 21609 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model_bath_view_lat_1)
Even with a log in price the residual continu to follow a trend, nut the data looks more normal Let’s add a categorical variable to end
house_price %>%
add_residuals(model_bath_view_lat_1) %>%
select(waterfront, condition, grade, renovated, resid) %>%
ggpairs()
Actually grade looks a great option let’s take it
model_bath_view_lat_grade <- lm(log(price) ~ bathrooms + view + lat + grade,
data = house_price)
summary(model_bath_view_lat_grade)
Call:
lm(formula = log(price) ~ bathrooms + view + lat + grade, data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1.58146 -0.18860 -0.01772 0.17675 1.31907
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -58.326621 0.754921 -77.262 < 2e-16 ***
bathrooms 0.124511 0.003531 35.267 < 2e-16 ***
view 0.125495 0.002726 46.038 < 2e-16 ***
lat 1.476731 0.014620 101.005 < 2e-16 ***
grade3 0.476419 0.340584 1.399 0.16188
grade4 0.237571 0.300003 0.792 0.42843
grade5 0.350229 0.295579 1.185 0.23607
grade6 0.510331 0.295049 1.730 0.08371 .
grade7 0.699234 0.295031 2.370 0.01780 *
grade8 0.897297 0.295086 3.041 0.00236 **
grade9 1.171552 0.295153 3.969 7.23e-05 ***
grade10 1.393909 0.295271 4.721 2.36e-06 ***
grade11 1.600347 0.295586 5.414 6.22e-08 ***
grade12 1.847965 0.296942 6.223 4.96e-10 ***
grade13 2.145218 0.306648 6.996 2.72e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2949 on 21598 degrees of freedom
Multiple R-squared: 0.6866, Adjusted R-squared: 0.6864
F-statistic: 3380 on 14 and 21598 DF, p-value: < 2.2e-16
Grade is a very good choose the problem is that the p error is more that 0.05 in grade 3,4,5. Let’s check anova
anova(model_bath_view_lat_1, model_bath_view_lat_grade)
Analysis of Variance Table
Model 1: log(price) ~ bathrooms + view + lat
Model 2: log(price) ~ bathrooms + view + lat + grade
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21609 2672.6
2 21598 1878.8 11 793.73 829.48 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For sure the model with grade is much better so we should include it, even with the high P error in the 3 variables, because the p error in anova is lower than 0.05 and the r^2 improve a lot with it so let’s check the residuals plots
par(mfrow = c(2,2))
plot(model_bath_view_lat_grade)
not plotting observations with leverage one:
19453
It is’t significally better the trend of the residual is less and they look a litle more normal, but no 100% better. 0.6864
model_final_1 <- lm(log(price) ~ bathrooms + view + lat + grade + bathrooms:view,
data = house_price)
summary(model_final_1)
Call:
lm(formula = log(price) ~ bathrooms + view + lat + grade + bathrooms:view,
data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1.57636 -0.18872 -0.01809 0.17704 1.35638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -58.426449 0.754702 -77.417 < 2e-16 ***
bathrooms 0.129974 0.003680 35.322 < 2e-16 ***
view 0.163892 0.007829 20.933 < 2e-16 ***
lat 1.478831 0.014617 101.172 < 2e-16 ***
grade3 0.475465 0.340377 1.397 0.16246
grade4 0.229094 0.299824 0.764 0.44482
grade5 0.340950 0.295404 1.154 0.24844
grade6 0.501788 0.294873 1.702 0.08882 .
grade7 0.688239 0.294859 2.334 0.01960 *
grade8 0.883633 0.294918 2.996 0.00274 **
grade9 1.158244 0.294984 3.926 8.65e-05 ***
grade10 1.382626 0.295099 4.685 2.81e-06 ***
grade11 1.596951 0.295407 5.406 6.52e-08 ***
grade12 1.865077 0.296779 6.284 3.35e-10 ***
grade13 2.210788 0.306717 7.208 5.87e-13 ***
bathrooms:view -0.015061 0.002879 -5.231 1.70e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2948 on 21597 degrees of freedom
Multiple R-squared: 0.687, Adjusted R-squared: 0.6868
F-statistic: 3160 on 15 and 21597 DF, p-value: < 2.2e-16
model_final_2 <- lm(log(price) ~ bathrooms + view + lat + grade + bathrooms:lat,
data = house_price)
summary(model_final_2)
Call:
lm(formula = log(price) ~ bathrooms + view + lat + grade + bathrooms:lat,
data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1.59676 -0.18863 -0.01769 0.17679 1.33299
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -67.465572 2.237132 -30.157 < 2e-16 ***
bathrooms 4.546040 1.018923 4.462 8.18e-06 ***
view 0.125564 0.002725 46.081 < 2e-16 ***
lat 1.669005 0.046656 35.772 < 2e-16 ***
grade3 0.505613 0.340510 1.485 0.13759
grade4 0.237212 0.299879 0.791 0.42894
grade5 0.348150 0.295457 1.178 0.23867
grade6 0.504813 0.294930 1.712 0.08698 .
grade7 0.690046 0.294917 2.340 0.01930 *
grade8 0.887993 0.294972 3.010 0.00261 **
grade9 1.163208 0.295038 3.943 8.09e-05 ***
grade10 1.386170 0.295155 4.696 2.66e-06 ***
grade11 1.594354 0.295467 5.396 6.88e-08 ***
grade12 1.842987 0.296821 6.209 5.43e-10 ***
grade13 2.155239 0.306530 7.031 2.11e-12 ***
bathrooms:lat -0.092936 0.021417 -4.339 1.43e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2948 on 21597 degrees of freedom
Multiple R-squared: 0.6869, Adjusted R-squared: 0.6867
F-statistic: 3158 on 15 and 21597 DF, p-value: < 2.2e-16
model_final_3 <- lm(log(price) ~ bathrooms + view + lat + grade + bathrooms:grade,
data = house_price)
summary(model_final_3)
Call:
lm(formula = log(price) ~ bathrooms + view + lat + grade + bathrooms:grade,
data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1.5886 -0.1879 -0.0178 0.1782 1.2944
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.820e+01 7.544e-01 -77.145 < 2e-16 ***
bathrooms 1.403e-01 5.241e-02 2.677 0.007442 **
view 1.249e-01 2.724e-03 45.847 < 2e-16 ***
lat 1.474e+00 1.461e-02 100.862 < 2e-16 ***
grade3 2.165e-01 3.605e-01 0.601 0.548098
grade4 9.442e-02 3.872e-01 0.244 0.807357
grade5 2.508e-01 3.015e-01 0.832 0.405361
grade6 4.118e-01 2.950e-01 1.396 0.162777
grade7 7.160e-01 2.945e-01 2.431 0.015056 *
grade8 9.759e-01 2.948e-01 3.310 0.000934 ***
grade9 1.130e+00 2.960e-01 3.819 0.000134 ***
grade10 1.158e+00 2.973e-01 3.893 9.92e-05 ***
grade11 1.454e+00 3.017e-01 4.820 1.45e-06 ***
grade12 1.956e+00 3.147e-01 6.215 5.22e-10 ***
grade13 2.066e+00 4.057e-01 5.093 3.55e-07 ***
bathrooms:grade3 1.022e+00 4.835e-01 2.113 0.034588 *
bathrooms:grade4 1.424e-01 2.764e-01 0.515 0.606481
bathrooms:grade5 7.346e-02 7.671e-02 0.958 0.338267
bathrooms:grade6 6.351e-02 5.460e-02 1.163 0.244790
bathrooms:grade7 -2.486e-02 5.266e-02 -0.472 0.636786
bathrooms:grade8 -4.915e-02 5.290e-02 -0.929 0.352825
bathrooms:grade9 -1.662e-04 5.365e-02 -0.003 0.997528
bathrooms:grade10 6.303e-02 5.416e-02 1.164 0.244479
bathrooms:grade11 2.627e-02 5.554e-02 0.473 0.636241
bathrooms:grade12 -4.255e-02 5.887e-02 -0.723 0.469881
bathrooms:grade13 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2943 on 21588 degrees of freedom
Multiple R-squared: 0.6881, Adjusted R-squared: 0.6877
F-statistic: 1984 on 24 and 21588 DF, p-value: < 2.2e-16
Iteration between bathrooms and view and with lat only increase R by 0.2% more
model_final_4 <- lm(log(price) ~ bathrooms + view + lat + grade + view:lat,
data = house_price)
summary(model_final_4)
Call:
lm(formula = log(price) ~ bathrooms + view + lat + grade + view:lat,
data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1.58237 -0.18897 -0.01757 0.17703 1.32343
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -58.614435 0.777143 -75.423 < 2e-16 ***
bathrooms 0.124605 0.003531 35.289 < 2e-16 ***
view 1.688361 1.002498 1.684 0.09217 .
lat 1.482786 0.015127 98.022 < 2e-16 ***
grade3 0.477584 0.340574 1.402 0.16084
grade4 0.237804 0.299993 0.793 0.42796
grade5 0.349944 0.295569 1.184 0.23644
grade6 0.509979 0.295039 1.729 0.08391 .
grade7 0.698816 0.295022 2.369 0.01786 *
grade8 0.896920 0.295076 3.040 0.00237 **
grade9 1.171142 0.295144 3.968 7.27e-05 ***
grade10 1.393703 0.295262 4.720 2.37e-06 ***
grade11 1.600690 0.295576 5.415 6.18e-08 ***
grade12 1.848950 0.296932 6.227 4.85e-10 ***
grade13 2.150237 0.306655 7.012 2.42e-12 ***
view:lat -0.032860 0.021078 -1.559 0.11902
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2949 on 21597 degrees of freedom
Multiple R-squared: 0.6866, Adjusted R-squared: 0.6864
F-statistic: 3155 on 15 and 21597 DF, p-value: < 2.2e-16
model_final_5 <- lm(log(price) ~ bathrooms + view + lat + grade + view:grade,
data = house_price)
summary(model_final_5)
Call:
lm(formula = log(price) ~ bathrooms + view + lat + grade + view:grade,
data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1.57978 -0.18848 -0.01764 0.17700 1.33238
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -58.381327 0.753768 -77.453 < 2e-16 ***
bathrooms 0.125520 0.003528 35.579 < 2e-16 ***
view 0.009631 0.050762 0.190 0.84952
lat 1.477882 0.014599 101.232 < 2e-16 ***
grade3 0.476393 0.339949 1.401 0.16112
grade4 0.212357 0.299948 0.708 0.47897
grade5 0.339713 0.295056 1.151 0.24960
grade6 0.504568 0.294501 1.713 0.08667 .
grade7 0.696312 0.294482 2.365 0.01806 *
grade8 0.890497 0.294538 3.023 0.00250 **
grade9 1.181441 0.294610 4.010 6.09e-05 ***
grade10 1.391084 0.294751 4.720 2.38e-06 ***
grade11 1.612001 0.295216 5.460 4.80e-08 ***
grade12 1.976621 0.297819 6.637 3.28e-11 ***
grade13 2.353857 0.319852 7.359 1.92e-13 ***
view:grade3 NA NA NA NA
view:grade4 0.292343 0.135960 2.150 0.03155 *
view:grade5 0.184996 0.059117 3.129 0.00175 **
view:grade6 0.164451 0.052497 3.133 0.00174 **
view:grade7 0.126383 0.051165 2.470 0.01351 *
view:grade8 0.133667 0.051002 2.621 0.00878 **
view:grade9 0.086454 0.051075 1.693 0.09053 .
view:grade10 0.115438 0.051245 2.253 0.02429 *
view:grade11 0.101042 0.051819 1.950 0.05120 .
view:grade12 0.034031 0.053944 0.631 0.52814
view:grade13 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2944 on 21589 degrees of freedom
Multiple R-squared: 0.6879, Adjusted R-squared: 0.6876
F-statistic: 2069 on 23 and 21589 DF, p-value: < 2.2e-16
model_final_6 <- lm(log(price) ~ bathrooms + view + lat + grade + lat:grade,
data = house_price)
summary(model_final_6)
Call:
lm(formula = log(price) ~ bathrooms + view + lat + grade + lat:grade,
data = house_price)
Residuals:
Min 1Q Median 3Q Max
-1.59526 -0.18978 -0.01655 0.17638 1.33769
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.948e+01 5.970e+01 1.164 0.2445
bathrooms 1.253e-01 3.534e-03 35.444 < 2e-16 ***
view 1.257e-01 2.726e-03 46.113 < 2e-16 ***
lat -1.212e+00 1.256e+00 -0.965 0.3345
grade3 -1.472e+02 8.258e+01 -1.783 0.0746 .
grade4 -9.544e+01 6.474e+01 -1.474 0.1405
grade5 -1.124e+02 6.015e+01 -1.869 0.0616 .
grade6 -1.277e+02 5.975e+01 -2.138 0.0325 *
grade7 -1.309e+02 5.971e+01 -2.192 0.0284 *
grade8 -1.220e+02 5.971e+01 -2.044 0.0410 *
grade9 -1.258e+02 5.974e+01 -2.107 0.0352 *
grade10 -1.245e+02 5.984e+01 -2.080 0.0375 *
grade11 -1.076e+02 6.026e+01 -1.786 0.0741 .
grade12 -1.194e+02 6.251e+01 -1.911 0.0560 .
grade13 2.419e+00 3.327e-01 7.273 3.65e-13 ***
lat:grade3 3.109e+00 1.741e+00 1.786 0.0741 .
lat:grade4 2.012e+00 1.362e+00 1.477 0.1396
lat:grade5 2.372e+00 1.265e+00 1.875 0.0609 .
lat:grade6 2.698e+00 1.257e+00 2.146 0.0319 *
lat:grade7 2.768e+00 1.256e+00 2.204 0.0275 *
lat:grade8 2.587e+00 1.256e+00 2.059 0.0395 *
lat:grade9 2.672e+00 1.257e+00 2.126 0.0335 *
lat:grade10 2.648e+00 1.259e+00 2.104 0.0354 *
lat:grade11 2.299e+00 1.268e+00 1.813 0.0698 .
lat:grade12 2.552e+00 1.315e+00 1.941 0.0523 .
lat:grade13 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2947 on 21588 degrees of freedom
Multiple R-squared: 0.6872, Adjusted R-squared: 0.6869
F-statistic: 1977 on 24 and 21588 DF, p-value: < 2.2e-16
0.6864 f1 = 0.6868 p error < 0.05 f2 = 0.6867 p error < 0.05 f3 = 0.6877 rr P error > 0.05 most of the time f4 = 0.6864 p error > 0.05 f5 = 0.6876 p error > 0.05 in 6 cases f6 = 0.6869 p error > 0.05 in 7 cases
I choose the interaction bathrooms:lat because has the lower p error and add a little more model_final_2
house_resid <- house_price %>%
add_residuals(model_final_2) %>%
select(-price)
coplot(resid ~ lat | bathrooms, data = house_resid)
calc.relimp(model_bath_view_lat_grade, type = "lmg", rela = TRUE)
Response variable: log(price)
Total response variance: 0.2773966
Analysis based on 21613 observations
14 Regressors:
Some regressors combined in groups:
Group grade : grade3 grade4 grade5 grade6 grade7 grade8 grade9 grade10 grade11 grade12 grade13
Relative importance of 4 (groups of) regressors assessed:
grade bathrooms view lat
Proportion of variance explained by model: 68.66%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
grade 0.43966938
bathrooms 0.21589254
view 0.09505446
lat 0.24938362
Average coefficients for different model sizes:
1group 2groups 3groups 4groups
bathrooms 0.3766719 0.2728513 0.1888339 0.1245107
view 0.2381620 0.1775130 0.1406023 0.1254952
lat 1.7073235 1.5941704 1.5201346 1.4767310
grade3 0.2177137 0.3022883 0.3886477 0.4764191
grade4 0.3139692 0.2939256 0.2684548 0.2375714
grade5 0.4597465 0.4301088 0.3935942 0.3502286
grade6 0.6802194 0.6318374 0.5751850 0.5103313
grade7 0.9729065 0.8941075 0.8028436 0.6992343
grade8 1.2724337 1.1635357 1.0384146 0.8972972
grade9 1.6236691 1.4914933 1.3406726 1.1715522
grade10 1.9390314 1.7785252 1.5966525 1.3939094
grade11 2.2676113 2.0700867 1.8474267 1.6003474
grade12 2.6454340 2.4080746 2.1419093 1.8479647
grade13 3.1642455 2.8611314 2.5210458 2.1452176
The most important is grade by 43% them bathrooms lat by 24 % bathroom by 21% and view by 9%